NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Fast and Accurate DNN Performance Estimation across Diverse Hardware Platforms

https://doi.org/10.1109/MASCOTS64422.2024.10786578

Kakrannaya, Vishwas Vasudeva; Rai, Siddhartha Balakrishna; Sivasubramaniam, Anand; Zhu, Timothy (October 2024, IEEE)

Full Text Available
Pirate: No Compromise Low-Bandwidth VR Streaming for Edge Devices

https://doi.org/10.1145/3676641.3716268

Zhang, Yingtian; Kang, Yan; Ying, Ziyu; Lu, Wanhang; Lan, Sijie; Xu, Huijuan; Maeng, Kiwan; Sivasubramaniam, Anand; Kandemir, Mahmut T; Das, Chita R (March 2025, ACM)

Free, publicly-accessible full text available March 30, 2026
SplitRPC: A {Control + Data} Path Splitting RPC Stack for ML Inference Serving

https://doi.org/10.1145/3589974

Kumar, Adithya; Sivasubramaniam, Anand; Zhu, Timothy (May 2023, Proceedings of the ACM Conference on Measurement and Analysis of Computing Systems)

The growing adoption of hardware accelerators driven by their intelligent compiler and runtime system counterparts has democratized ML services and precipitously reduced their execution times. This motivates us to shift our attention to efficiently serve these ML services under distributed settings and characterize the overheads imposed by the RPC mechanism ('RPC tax') when serving them on accelerators. The RPC implementations designed over the years implicitly assume the host CPU services the requests, and we focus on expanding such works towards accelerator-based services. While recent proposals calling for SmartNICs to take on this task are reasonable for simple kernels, serving complex ML models requires a more nuanced view to optimize both the data-path and the control/orchestration of these accelerators. We program today's commodity network interface cards (NICs) to split the control and data paths for effective transfer of control while efficiently transferring the payload to the accelerator. As opposed to unified approaches that bundle these paths together, limiting the flexibility in each of these paths, we design and implement SplitRPC - a control + data path optimizing RPC mechanism for ML inference serving. SplitRPC allows us to optimize the datapath to the accelerator while simultaneously allowing the CPU to maintain full orchestration capabilities. We implement SplitRPC on both commodity NICs and SmartNICs and demonstrate how GPU-based ML services running different compiler/runtime systems can benefit. For a variety of ML models served using different inference runtimes, we demonstrate that SplitRPC is effective in minimizing the RPC tax while providing significant gains in throughput and latency over existing kernel by-pass approaches, without requiring expensive SmartNIC devices.
more » « less
SplitRPC: A {Control + Data} Path Splitting RPC Stack for ML Inference Serving

https://doi.org/10.1145/3578338.3593571

Kumar, Adithya; Sivasubramaniam, Anand; Zhu, Timothy (January 2023, Proceedings of the ACM on measurement and analysis of computing systems)

The growing adoption of hardware accelerators driven by their intelligent compiler and runtime system counterparts has democratized ML services and precipitously reduced their execution times. This motivates us to shift our attention to efficiently serve these ML services under distributed settings and characterize the overheads imposed by the RPC mechanism (‘RPC tax’) when serving them on accelerators. The RPC implementations designed over the years implicitly assume the host CPU services the requests, and we focus on expanding such works towards accelerator-based services. While recent proposals calling for SmartNICs to take on this task are reasonable for simple kernels, serving complex ML models requires a more nuanced view to optimize both the data-path and the control/orchestration of these accelerators. We program today’s commodity network interface cards (NICs) to split the control and data paths for effective transfer of control while efficiently transferring the payload to the accelerator. As opposed to unified approaches that bundle these paths together, limiting the flexibility in each of these paths, we design and implement SplitRPC - a {control + data} path optimizing RPC mechanism for ML inference serving. SplitRPC allows us to optimize the datapath to the accelerator while simultaneously allowing the CPU to maintain full orchestration capabilities. We implement SplitRPC on both commodity NICs and SmartNICs and demonstrate how GPU-based ML services running different compiler/runtime systems can benefit. For a variety of ML models served using different inference runtimes, we demonstrate that SplitRPC is effective in minimizing the RPC tax while providing significant gains in throughput and latency over existing kernel by-pass approaches, without requiring expensive SmartNIC devices.
more » « less
Full Text Available
Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs

https://doi.org/10.1109/MICRO61859.2024.00091

Jain, Rishabh; Bhasi, Vivek M; Jog, Adwait; Sivasubramaniam, Anand; Kandemir, Mahmut T; Das, Chita R (November 2024, IEEE)

Personalized recommendation is a ubiquitous application on the internet, with many industries and hyperscalers extensively leveraging Deep Learning Recommendation Models (DLRMs) for their personalization needs (like ad serving or movie suggestions). With growing model and dataset sizes pushing computation and memory requirements, GPUs are being increasingly preferred for executing DLRM inference. However, serving newer DLRMs, while meeting acceptable latencies, continues to remain challenging, making traditional deployments increasingly more GPU-hungry, resulting in higher inference serving costs. In this paper, we show that the embedding stage continues to be the primary bottleneck in the GPU inference pipeline, leading up to a 3.2× embedding-only performance slowdown. To thoroughly grasp the problem, we conduct a detailed microarchitecture characterization and highlight the presence of low occupancy in the standard embedding kernels. By leveraging direct compiler optimizations, we achieve optimal occupancy, pushing the performance by up to 53%. Yet, long memory latency stalls continue to exist. To tackle this challenge, we propose specialized plug-and-play-based software prefetching and L2 pinning techniques, which help in hiding and decreasing the latencies. Further, we propose combining them, as they complement each other. Experimental evaluations using A100 GPUs with large models and datasets show that our proposed techniques improve performance by up to 103% for the embedding stage, and up to 77% for the overall DLRM inference pipeline.
more » « less
Full Text Available
Optimizing CPU Performance for Recommendation Systems At-Scale

https://doi.org/10.1145/3579371.3589112

Jain, Rishabh; Cheng, Scott; Kalagi, Vishwas; Sanghavi, Vrushabh; Kaul, Samvit; Arunachalam, Meena; Maeng, Kiwan; Jog, Adwait; Sivasubramaniam, Anand; Kandemir, Mahmut Taylan; et al (June 2023, International Symposium on Computer Architecture 2023)

Deep Learning Recommendation Models (DLRMs) are very popular in personalized recommendation systems and are a major contributor to the data-center AI cycles. Due to the high computational and memory bandwidth needs of DLRMs, specifically the embedding stage in DLRM inferences, both CPUs and GPUs are used for hosting such workloads. This is primarily because of the heavy irregular memory accesses in the embedding stage of computation that leads to significant stalls in the CPU pipeline. As the model and parameter sizes keep increasing with newer recommendation models, the computational dominance of the embedding stage also grows, thereby, bringing into question the suitability of CPUs for inference. In this paper, we first quantify the cause of irregular accesses and their impact on caches and observe that off-chip memory access is the main contributor to high latency. Therefore, we exploit two well-known techniques: (1) Software prefetching, to hide the memory access latency suffered by the demand loads and (2) Overlapping computation and memory accesses, to reduce CPU stalls via hyperthreading to minimize the overall execution time. We evaluate our work on a single-core and 24-core configuration with the latest recommendation models and recently released production traces. Our integrated techniques speed up the inference by up to 1.59x, and on average by 1.4x.
more » « less
Overflowing emerging neural network inference tasks from the GPU to the CPU on heterogeneous servers

https://doi.org/10.1145/3534056.3534935

Kumar, Adithya; Sivasubramaniam, Anand; Zhu, Timothy (June 2022, Proceedings of the 15th ACM International Conference on Systems and Storage)

Full Text Available
IoTRepair: Flexible Fault Handling in Diverse IoT Deployments

https://doi.org/10.1145/3532194

Norris, Michael; Celik, Z. Berkay; Venkatesh, Prasanna; Zhao, Shulin; McDaniel, Patrick; Sivasubramaniam, Anand; Tan, Gang (August 2022, ACM Transactions on Internet of Things)

IoT devices can be used to complete a wide array of physical tasks, but due to factors such as low computational resources and distributed physical deployment, they are susceptible to a wide array of faulty behaviors. Many devices deployed in homes, vehicles, industrial sites, and hospitals carry a great risk of damage to property, harm to a person, or breach of security if they behave faultily. We propose a general fault handling system named IoTRepair, which shows promising results for effectiveness with limited latency and power overhead in an IoT environment. IoTRepair dynamically organizes and customizes fault-handling techniques to address the unique problems associated with heterogeneous IoT deployments. We evaluate IoTRepair by creating a physical implementation mirroring a typical home environment to motivate the effectiveness of this system. Our evaluation showed that each of our fault-handling functions could be completed within 100 milliseconds after fault identification, which is a fraction of the time that state-of-the-art fault-identification methods take (measured in minutes). The power overhead is equally small, with the computation and device action consuming less than 30 milliwatts. This evaluation shows that IoTRepair not only can be deployed in a physical system, but offers significant benefits at a low overhead.
more » « less
Full Text Available
To move or not to move?: page migration for irregular applications in over-subscribed GPU memory systems with DynaMap

https://doi.org/10.1145/3456727.3463766

Chang, Chia-Hao; Kumar, Adithya; Sivasubramaniam, Anand (June 2021, Proceedings of the 14th ACM International Conference on Systems and Storage)
null (Ed.)
Full Text Available
Mediating Power Struggles on a Shared Server

Narayanan, Iyswarya; Sivasubramaniam, Anand (July 2020, 2020 IEEE International Symposium on Performance Analysis of Systems and Software)

Full Text Available

« Prev Next »

Search for: All records